City of San Francisco Trees

Imagine you've been commissioned by the City of San Francisco to tackle a problem they've been having with local flora. The parks department has taken extensive documentation of the city's trees since the 1970s - what species are growing, where they are, who they're maintained by - amassing a dataset of over 200K trees in that time.

The funding for that project has recently been called into question, and the City Board needs to see its value in reapproving funds for the following year. Stakeholders have raised several concerns over the past few years, and your job is to use the data to answer them. Good luck!

Jupyter Notebook

First things first, let's get some terminology straight.

  • The language we're working in – Python 3.7
  • The editor we're using is Google Colab – The code runs on Google's servers, and shows the results on our browser
  • This file is an interactive Python notebook, a .ipynb file. These are pretty special, also known as Jupyter notebooks.

Jupyter notebooks have a few special properties that make it ideal for work with data:

  • Code is organized into cells, which can be code or markdown
  • We can run the cells in any order, try it out!
  • The last item returned in a cell will print automatically, no need to wrap it with print()
In [1]:
x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'
In [2]:
print(x) # Run this cell after running the one above, and again after running the one below
In [3]:
x = 42
In [ ]:
def UltimateQuestion(computer_name):
    return computer_name + ' is thinking...'
In [ ]:
UltimateQuestion('DeepThought')

Importing packages

We use the pandas package to easily work with data as tables.
The numpy package allows us to work with some other special data types, like missing values

We'll rename these as pd and np, just so its easier to refer to later on

In [4]:
# as allows us to rename the packages
import pandas as pd
import numpy as np
In [5]:
pd.options.display.max_rows = 5 # Just to shorten output

Importing data

For this semester, we'll typically work with data in tabular format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a .csv file ending, short for comma seperated values.

To import this, let's use the pd.read_csv() function:

In [6]:
# Replace w/ URL
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/workshop/trees.csv'
trees = pd.read_csv(url)

Here, we've saved the data to a dataframe object named trees

In [7]:
type(trees)
Out[7]:
pandas.core.frame.DataFrame

DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do

Exploring dataframes

Let's take a look at the data. We'll use the functions .head() and .tail()

In [8]:
trees.head()
Out[8]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
9755 24126 Permitted Site Private 9.0 3X3 Melaleuca quinquenervia Cajeput 1997-05-07 Sidewalk Cutout 37.736081 -122.385931 140X Whitney Young Cir
9950 26287 DPW Maintained DPW 3.0 3X3 Tristaniopsis laurina Swamp Myrtle 1997-06-20 Sidewalk Cutout 37.725866 -122.442437 1900X San Jose Ave
22309 83491 DPW Maintained Private 4.0 3x3 Lagunaria patersonii Primrose Tree 2008-01-24 Sidewalk Cutout 37.749745 -122.507247 1966 48th Ave

How big is the dataset? .shape returns a tuple with the dimensions as (rows, columns)

In [9]:
trees.shape
Out[9]:
(36073, 13)

Let's try to understand our data a bit better.

  • How many different tree species are represented in the dataset?
In [10]:
trees.species_name.nunique()
Out[10]:
367
  • Which tree shows up the most frequently?
In [11]:
trees.common_name.value_counts()
Out[11]:
Swamp Myrtle              2781
Brisbane Box              2751
                          ... 
Apple Tree 'Gala'            1
Autumn Sentinel Ginkgo       1
Name: common_name, Length: 365, dtype: int64

What are the biggest trees?
Note: dbh represents diameter of the tree base

In [12]:
trees.sort_values(by='dbh', ascending=False)
Out[12]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
34738 14513 DPW Maintained DPW 100.0 4X4 Fraxinus uhdei Shamel Ash: Evergreen Ash 2018-06-18 Sidewalk Cutout 37.776560 -122.446728 501 Masonic Ave
28183 12738 DPW Maintained DPW 100.0 4x4 Tristaniopsis laurina 'Elegant' Small-leaf Tristania 'Elegant' 2013-07-12 Sidewalk Cutout 37.786183 -122.477196 1630 Lake St
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14796 44797 DPW Maintained Private 0.0 NaN Prunus serrulata Ornamental Cherry 2001-04-12 Sidewalk Cutout 37.765145 -122.480368 1206 22nd Ave
36072 144192 DPW Maintained Private 0.0 Width 4ft Lophostemon confertus Brisbane Box 2020-01-25 Sidewalk Cutout 37.776940 -122.502697 618 42nd Ave

36073 rows × 13 columns

Subsetting

Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:

We can filter rows from a dataframe based on some condition

  • Show only trees north of Golden Gate Park (latitude > 37.77285)?
  • Show only Cherry Plum trees
In [13]:
trees[trees.latitude > 37.77285]
Out[13]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
1 30321 DPW Maintained Private 2.0 NaN Magnolia grandiflora Southern Magnolia 1956-01-06 Sidewalk Cutout 37.795718 -122.441860 2828 Divisadero St
5 30339 DPW Maintained Private 11.0 NaN Platanus x hispanica Sycamore: London Plane 1956-02-15 Sidewalk Cutout 37.793189 -122.441380 2560 Divisadero St
... ... ... ... ... ... ... ... ... ... ... ... ... ...
36071 144157 DPW Maintained Private 0.0 Width 4ft Tristaniopsis laurina Swamp Myrtle 2020-01-25 Sidewalk Cutout 37.774642 -122.501452 746 41st Ave
36072 144192 DPW Maintained Private 0.0 Width 4ft Lophostemon confertus Brisbane Box 2020-01-25 Sidewalk Cutout 37.776940 -122.502697 618 42nd Ave

15811 rows × 13 columns

In [14]:
trees[trees.common_name == 'Cherry Plum']
Out[14]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
149 53700 Permitted Site Private 14.0 NaN Prunus cerasifera Cherry Plum 1970-03-04 Sidewalk Cutout 37.746081 -122.426025 263 Duncan St
198 54020 DPW Maintained Private 13.0 NaN Prunus cerasifera Cherry Plum 1972-04-07 Sidewalk Cutout 37.772780 -122.494875 862 35th Ave
... ... ... ... ... ... ... ... ... ... ... ... ... ...
35700 246210 DPW Maintained Private 3.0 Width 0ft Prunus cerasifera Cherry Plum 2019-10-01 Sidewalk Cutout 37.767967 -122.443800 725 Buena Vista Ave West
35701 246211 DPW Maintained Private 3.0 Width 0ft Prunus cerasifera Cherry Plum 2019-10-01 Sidewalk Cutout 37.767917 -122.443821 725 Buena Vista Ave West

1180 rows × 13 columns

Data Manipulation

Another common task is to find patterns based on groups.

  • Which tree type, on average, has the largest diameter?
In [15]:
trees.groupby(by='common_name').agg('mean')['dbh'].sort_values(ascending=False).head()
Out[15]:
common_name
Date palm (species unknown)    70.000000
False Avocado                  35.000000
Canary Island Date Palm        30.912664
Flooded Box: Coolibah          30.000000
Morton Bay Fig                 29.000000
Name: dbh, dtype: float64

Visualization

First things first, let's import the package to help us visualize the data, plotly.

If this package isn't yet included, we can install it using !pip install plotly. More on this week 5.

In [16]:
import plotly.express as px

## Uncomment & run the following if graphs don't show
# import plotly.io as pio
# pio.renderers.default='notebook'

Note that we're using the sub package of the broader package, called plotly express. This simplifies a lot of the more difficult steps

Plotly express has a broad range of options to play with, let's take a look at the documentation.
Do a quick google search to pull up documentation for px.scatter OR run px.scatter? in a Jupyter cell

In [17]:
px.scatter?
In [18]:
trees_sample = trees.sample(frac=.2)
In [19]:
fig = px.scatter(trees_sample, x='date', y='dbh')
fig.show('notebook')

Clearly, there aren't any obvious trends going on from this view. Let's add in some more parameters

In [20]:
fig = px.scatter(trees_sample, x='date', y='dbh', 
                 opacity=.15, color='site_location', 
                 hover_name='common_name', hover_data=['site_location','site_type','address'],
                 marginal_x = 'histogram', marginal_y = 'histogram',
                 color_discrete_sequence = px.colors.qualitative.Prism[4:]
                )
fig.show('notebook')

Geographic Plots

The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm.
Is there a general area in which there are more roadside / median trees?

In [21]:
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude', 
                        color='site_location', size='dbh', opacity=.4,
                        color_discrete_sequence=px.colors.qualitative.Prism[4:],
                        hover_name='address',hover_data=['common_name','site_location','caretaker'],
                        zoom=11, mapbox_style="stamen-terrain",
                       )
fig.show('notebook')
In [ ]: